TAZ, a transcriptional coactivator present in the Hippo signaling pathway, has been also identified as a downstream element ofthe Wnt signaling pathway, where the main transcriptional coactivator is \(\beta\)-catenin. When the Wnt pathway is activated, the APC and the so-called \(\beta\)-catenin destruction complex are inhibited, stabilizing both \(\beta\)-catenin and TAZ and allowing their location to the nucleus to activate their respective target genes.
The objective of this study made by Azzolin et al[^1^](#1) is determine the role of TAZ proteĆÆn in the WNT signaling pathway and whether TAZ relates with this pathway independently of bCatenin or not.
The study analysed in this exercice looks for the role of TAZ protein as a potenital mediator of Wnt signaling and was performed by Azzolin et al [^1^](#1). It consist in a comparison analysis (class comparison) with 16 samples and 4 conditions with 4 biological replicates each, all performed in human SW480 cell line, which is a colorectal cancer cell line:
The data was obtained using one-color expression microarrays HG-U133_Plus_2 from Affymetrix and can be retrieved from Gene Expression Omnibus (GEO) under the identificator GSE39904.
The study has other data, but for this exercise, only the microarray data presented above will be considered.
Figure 1._Workflow of the analysis_. Black text within boxes represents the steps of the analysis, blue text represents the functions used in each step of the analyses and black text outside of boxes represents the inputs and outputs of each function.
The corresponding code used and all the analyses can be consulted in the appendix code section. By clicking in each step, the code from each step can also be consulted.
The steps of the analysis are:
Setup of the raw data: Download the data from the Gene Expression Omnibus in .cel file format and create a targets.tsv file with the information of each sample (name, descriptive name, conditionsā¦). The data identificator in GEO is GSE39904.
Data reading. Use the function readAffy() from the R package affy to read the .cel raw data alogn with the targets.tsv file.
Quality control of the raw data: Perform different quality controls of the raw data:
Physical control: using immage() from affy, images with the fluorescence of each probe in each microarray have been obtained. They are useful to tell if there is any physical damage in the microarrays.
Quality stats: the function qc() from simpleaffy returns the percentage of probes where fluorescence was detected, the background fluorescence and the 3ā/5ā ratios of cDNA for actin and GAPDH.
Intensity boxplot: the function boxplot() from affy returns a boxplot with the luminscence distribution of each sample. The luminiscence is proportional to the gene expression in each probe and it is assumed that the distribution should be similar across all samples.
Principal component analysis: the principal component analysis studies the differences between the samples. The function plotPCA3() to compute the PCA has been retrieved from the Github repository of the microarray course.
Preprocessing of the raw data: Process and normalize the raw data obtained in the step 2 by using a Robust Multyarray Analysis (RMA) which corrects the background fluorescence, normalizes the raw data (log2 normalization) and estimates the expression of the genes. The function used for the RMA is rma() from the package affy.
Quality controls of processed data: Perform quality controls of the processed data obtained in the step 4. Only the intensity boxplot and the principal component analysis have been performed for the normalized data.
Filtering: filter the processed data from step 4 has been performed to get rid of the least variable genes. To perform it, function nsFilter() from the R package genefilter has been used. The filtering has performed because the differential expression analysis is influenced for the number of genes we look at: if the number of genes increases, the probability of false discoveries also increases and the need of p-value correction is greater. Hence, since if a gene is differentially expressed, it can be assumed that is variable between samples thus, the low-variable genes accross samples have been filtered.
DIFFERENTIAL EXPRESSION ANALYSIS
Setup of the DE analysis: from the filtered data obtained in step 6, the expression levels of each gene in each sample have been obtained using the function exprs() from the package affy. Then, using model.matrix() a design matrix is created to show which samples go to each condition.
Model fitting (1): taking both, the desing matrix and the expression levels obtained in the step 7, and using the function lmFit() from the package limma the data is fitted to a regression model that will be used to determine the differentially expressed genes. Then, with makeContrasts() a contrast matrix with the conditions which will be compared in each contrasts is created.
Model fitting (2): with the model and the contrast matrix obtained in the step 8 and using the function contrasts.fit() from limma the estimated coeficients for each test (each comparison between genes) and the standard errors are computed. Then using eBayes(), the t- and F-statistics are computed, as well as the log-odds of differential expression (similar to the probability of a gene of being differentially expressed).
DE information: After runing the function eBayes(), topTable() has been used to retrieve the information of each contrast (ProbeID, p-value, adjusted p-value, log2FCā¦). It is important to indicate a p-value correction method (in our case is Benjamini-Hochberg) in order to get rid of the error produced by the multiple testing.
Annotation of DEGs: using the custom function getDegs(), genes has been annotated as up or downregulated (indicating the thresholds of log2FC and adjusted p-value) and gene symbol, entrez ID and gene name for each probe ID has been retrieved
Visualization of DEGs: using the custom function volcanoPlot2(), DEGs have been ploted using their log2(FC) in the X axis and the -log10(adjustedPvalue) in the Y axis. The 5 most upregulated and 5 most downregulated genes are annotated in the plot.
GO analysis: using the package clusterProfiler, the GO terms (for biological function) of upregulated and downregulated genes have been obtained in order to see which biological functions are the DEGs related to.
This study can be retrieved on github: https://github.com/amitjavilaventura/OmicsDataAnalisis-PEC1-Microarrays
The quality of the raw data os correct:
The quality of the data improves greatly after the RMA step:
After the filtering step, only 5044 genes remained from a total of 54675, indicating that the majority of the genes have small variation across all samples.
In healthy conditions, when Wnt signaling pathway is activated, the APC proteĆÆn is inhibited, which allows the stabilization \(\beta\)-catenin impeding its degradation. Then, the \(\beta\)-catenin goes to the nucleus and binds to TEF/LEF protein family and initiates the transcription of the targeted genes. On the other hand, when the Wnt pathway is inhibited, the APC proteĆÆn forms, along with other proteins, the so-called \(\beta\)-catenin destruction complex. This complex phosphorilates \(\beta\)-catenin, which goes to the proteasome to be degraded. It has been shown that TAZ, which is an intermediate of the Hippo pathway, is also degraded along with the \(\beta\)-catenin when the Wnt pathway is inactive [^1^](#1).
In our cell line (control), as in most cancer cell lines, the APC gene is mutated and it has lost the function, hence the Wnt pathway is active and its target genes are transcribed, allowing high cell proliferation rate.
If we knock-down bCAT with a siRNA in our cell line, its targets genes wonāt be transcribed anymore, hence the Wnt pathway will be less active and the gene expression will be more similar to those cells expressing WT APC. The same goes to the TAZ siRNA-mediated knock-down but, in this case, the gene expression wonāt be necessarily more similar to the expression of those cells expressing WT APC, since TAZ is also related to the Hippo pathway.
The comparisons that will be performed are shown in the following table:
| APC_WT - Control | siTAZ - Control | siBCAT - Control | siTAZ - APC_WT | siBCAT - APC_WT | siBCAT - siTAZ | |
|---|---|---|---|---|---|---|
| Down | 550 | 261 | 358 | 604 | 453 | 509 |
| NotSig | 4203 | 4606 | 4366 | 3590 | 3893 | 4045 |
| Up | 291 | 177 | 320 | 850 | 698 | 490 |
Figure 2. Volcano plots showing the up and downregulated genes for all the contrasts
As expected, the number of DEGs in the contrast of the APC WT vs control is quite high (Table 2), which indicates that both groups are different and that our cell line has lost the function of its APC protein or, at least that APC is performing worse than in healthy conditions.
Figure 3. Most enriched GOterms for the DEGs of āAPC WT vs Controlā
After restoration of APC activity (APC WT group), a high number of genes has been expressed differentially. This suggest a great difference between the two groups which is mainly caused by the different versions of the APC gene indicating that the SW480 cell line carries a non functional APC gene.
Among the most upregulated DEGs there are the HLA-DRA and HLA-DPA1 genes (Fig. 2), both related to the immune system. As it is shown in Fig. 3, some of the most enriched GO terms for the upregulated genes are related to pathways of the immune system like the response to bacteria, response to LPS, as well as responses to proteins such as interferon gamma.
Paradoxally, the downregulated genes after the APC restoration participate in some biological functions related to the Wnt signaling pathway, but also to its negative regulation. This is the contrary to what could be expected, because the APC loss of function in the control group activates the Wnt pathway and the APC restoration should upregulate (and not downregulate) the genes that inhibits the pathway.
As expected, in the TAZ knock-down vs control, the number of DEGs is much lower than in the contrast of bCAT knockdown vs control (Table 2). This could be because in the control group, the WNT pathway is activated but if we knock-down bCAT, which is the main effector of the pathway, the target genes wonāt be transcribed. Contrarily, if we knok-down TAZ, some of the WNT target genes may be affected, but not as much as bCAT.
limma::vennDiagram(res.selected[,2:3], include = c("up", "down"), circle.col = c("red", "blue"), show.include = T)
title("DE genes in common between the 'TAZ-KD vs Control' and 'bCAT-KD vs Control'\n The upper number represents the upregulated genes and the lower number represents the downregulated genes")
Figure 3. DE genes in common between the āTAZ-KD vs Controlā and ābCAT-KD vs Controlā Genes selected with FDR < 0.05 and logFC > 1
Between the contrasts of siTAZ vs Control and siBCAT vs Control, the number of shared DEGs is very low compared to the total of DEGs in both contrasts (Fig. 3). Moreover, the 5 most upregultated and five most downregualted genes are completely different between contrast (Fig. 2).
This suggests that the majority of target genes of TAZ and bCAT are different.
Figure 4. DEG-related GO terms in āTAZ-KD vs Controlā and ābCAT-KD vs Controlā
Curiously, there are no enriched GOterms for the upregulated genes in the contrast TAZ-KD vs Control and for those downregulated genes of the same contrast, there is only one enriched GOterm which refers to the innate immune response of mucosa. Contrarily to this, the number of enriched GOterms in both up and downregulated genes of the contrast bCAT-KD vs Control is very high.
So, not only the target genes of TAZ and bCAT are different, also the biological function of those genes are completely different.
If TAZ protein levels rise after Wnt pathway activation, upon APC mutation TAZ levels should be at its maximum. This fact could be due to two different possibilites:
If the first hypothesis is correct, upon APC restoration the TAZ gene should be downregulated, hence in the contrast of APC WT vs Control TAZ should be among the downregulated genes, which can be retrieved in the file results/degs/apcwt_vs_control_down.tsv. However TAZ gene is not among the downregulated genes of the contrast, which means that the Wnt pathway does not activate TAZ transcription. Thus, the high levels of TAZ protein upon Wnt activation must be due to another process.
To proof the second hypothesis (which is the one stated by Azzolin et al. [^1^](#1)), we would need proteĆÆn data, along with other mRNA data. However, if TAZ protein levels are high, its target genes should be downregulated upon restoration of APC activity (APC WT vs Control).
Targets of human transcription factors can be downloaded from TRRUST database [^5^](#5). In the database TAZ appears as WWTR1.
In the TRRUST database, TAZ has 6 annotated targets, however, none of them are in the downregulated genes (nor the upregulated ones). Only one target of TAZ (ASNS) has been found in the not differentially expressed genes. The possible reasons are: (1) very stringent filtering of least variable genes and (2) the TRUST database has very few targets for TAZ.
Hence, we cannot proof the hypothesis that the original study states: TAZ mediates part of the gene expression induced by the Wnt signaling pathway.
Azzolin L, Zanconato F, Bresolin S, Forcato M et al. Role of TAZ as mediator of Wnt signaling. Cell 2012 Dec 21;151(7):1443-56. PMID: 23245942
Gonzalo, Ricardo and Sanchez-Pla, Alex. ASP teaching - Omics_Data_Analysis-Case_Study_1-Microarrays. 2019. Github
Benjamini, Yoav, and Yosef Hochberg. 1995. āControlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.ā Source Journal of the Royal Statistical Society. Series B (Methodological) 57 (1): 289ā300. Link.
Francisco Romero Campero. Microarrays. 2018. Github.
TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Research 26 Oct, 2017 (link)
WORKDIR <- "~/Dropbox/AdriaĢ/ESTUDIS/master_bioinfo_uoc/OmicsDataAnalisis/PEC1/"
setwd(WORKDIR)
dir.create(path = "./results/", showWarnings = F, recursive = T)
dir.create(path = "./figures/", showWarnings = F, recursive = T)
if(!require(Biobase)){ BiocManager::install("Biobase") }; library(Biobase)
#if(!require(pvca)){ BiocManager::install("pvca") }; library(pvca)
if(!require(hgu133plus2.db)){ BiocManager::install("hgu133plus2.db") }; library(hgu133plus2.db) ### gene annotation
if(!require(clusterProfiler)){ BiocManager::install("clusterProfiler") }; library(clusterProfiler)
if(!require(openxlsx)){ install.packages("openxlsx") }; library(openxlsx)
library(affy)
library(affyPLM)
library(simpleaffy)
library(limma)
library(genefilter)
library(ggplot2)
library(ggrepel)
library(tidyr)
library(magrittr)
library(dplyr)
library(AnnotationDbi)
We must download the data, store the .cel files in the ādataā directory, create a pheno file where we can store the information (name, conditions, group, etc) of each .cel file and then, use this information to read the data.
targets <- Biobase::read.AnnotatedDataFrame(filename = "pheno.tsv", sep = "\t", header = T, row.names = 1)
rawData <- affy::ReadAffy(celfile.path = "./data/", verbose = T, phenoData = targets, sampleNames = c(targets@data$ShortName))
targets@data$ShortName -> rownames(pData(rawData))
sampleID <- targets@data$ShortName
write.csv(x = exprs(rawData), file = "./results/Raw_expr_data.tsv", sep = "\t")
In the quality control of the raw data, we must visualize the images taken by the fluorescence scanner in order to determine if there is any physical damage in the microarrays.
for(i in 1:16){ affy::image(rawData[,i], col = rainbow(100)) }
It looks like that all the microarrays are in a good state, so it seems that there is not physical damage and we are able to continue with the analyses.
Using the function qc() from the package simpleaffy we can compute some quality statistics for each sample.
qc <- simpleaffy::qc(rawData)
plot(qc)
After plotting the qc() output, we obtain a figure where each row corresponds to a sample.
In each row there is the sample name in black and two numbers which, in ourn case, are in blue (if they werenāt correct or they were problematic, they would have been red). There are also an empty triangle and an empty circle.
Another thing we should do is drawing a boxplot with the luminiscence of the probes (proportional gene expression) in each sample.
One assumption is that the levelsof the expression (mean, median, etc) should be similar across all samples.
affy::boxplot(rawData, col=c(rep("red",4), rep("blue",4), rep("green",4), rep("yellow",4)), las=2, ylab="Luminescence")
As it can be clearly seen in the figure above, the expression across the different samples is very different and we cannot compare one sample to another. This tells us that we must do a normalization step.
affy::hist(rawData,col=rainbow(16))
The function we will use to compute the PCA has been retrieved from the course notes.
plotPCA3 <- function (datos, labels, factor, title, scale, colores, size = 1.5, glineas = 0.25) {
exprs <- affy::exprs(datos)
data <- prcomp(t(exprs), scale = scale)
# plot adjustments
dataDf <- data.frame(data$x)
dataDf$cond <- factor
loads <- round(data$sdev^2/sum(data$sdev^2)*100, 1)
# main plot
p1 <- ggplot(dataDf, aes(x=PC1, y=PC2)) +
geom_point(mapping = aes(x = PC1, y = PC2, fill = cond, color = cond), size = 4) +
geom_hline(yintercept = 0, color = "gray70") +
geom_vline(xintercept = 0, color = "gray70") #+
#coord_cartesian(xlim = c(min(data$x[,1])-5, max(data$x[,1])+5))
# avoiding labels superposition
p1 + geom_text_repel(aes(y = PC2 + 0.25, label = labels), segment.size = 0.25, size = size) +
labs(x = c(paste("PC1", loads[1], "%")),y=c(paste("PC2", loads[2], "%"))) +
ggtitle(paste("Principal Component Analysis for: ", title, sep = " "))+
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank(),
axis.text = element_text(size = 10)) +
scale_color_manual(aesthetics = c("colour", "fill"), values=colores)
}
plotPCA3(rawData, labels = targets$ShortName, factor = targets@data$Condition,
title="Raw data", scale = FALSE, size = 3,
colores = c("red", "blue", "green", "yellow"))
Looking at the PCA, we can study the variance across all samples. As it can be observed in the figure above, the samples have great variance between them, with the PC1 explaining 55% of the variance and PC2 explaining 19%.
However, the most worrying thing is that the replicates within each condition are very different from the other intra-condition replicates. Even in some cases, samples from other conditions are closer to some replicates that the samples from the same conditions. This is the case of the groups treated with siGFP (control) and siBCAT (bCAT KD). Contrarily, the replicates of the group treated with siTAZ cluster relativeley close between each other and the replicates from the APC-WT group have bigger variance but are far from the other conditions.
To process the raw data, we must use a Robust Multiarray Analysis (RMA), which performs in three steps:
To perform a RMA, we can use the rma() function from the package affy.
# performing the RMA
normData <- affy::rma(object = rawData, normalize = T)
# saving the processed data
write.csv(x = exprs(normData), file = "./results/Norm_expr_data.tsv", sep = "\t")
Now, to see if the preprocessing has improved the data, we must do some quality controls more.
affy::boxplot(normData, col=c(rep("red",4), rep("blue",4), rep("green",4), rep("yellow",4)), ylab="Luminescence", las = 2)
Boxplot of luminiscence distribution for all samples after RMA processing.
After the RMA preprocessing of the raw data, the levels of expression (luminsicence) across all samples should be comparable.
affy::hist(normData, col=rainbow(16))
As before, the principal component analysis shows the difference between samples.
plotPCA3(normData, labels = targets@data$ShortName, factor = targets@data$Condition,
title="Preprocessed data", scale = FALSE, size = 3,
colores = c("red", "blue", "green", "yellow"))
Principal component analysis of RMA-processed data
After the RMA of the raw data, all intra-condition samples cluster perfectly together while the four conditions cluster separate from each other.
When we do a differential expression analysis, the number of genes on which we do it has a very important role. The more genes we have, the more statistical tests we do to see if each gene is differentially expressed between groups, hence, whitout taking real differnces into account, a lot of tests will result in a significant p-value only by chance, giving more false positives, and this number of false positives increases with the number of genes.
sds <- apply (exprs(normData), 1, sd)
sdsO<- sort(sds)
plot(1:length(sdsO), sdsO, main="Distribution of variability for all genes",
sub="Vertical lines represent 90% and 95% percentiles",
xlab="Gene index (from least to most variable)", ylab="Standard deviation")
abline(v=length(sds)*c(0.9,0.95), lty = 2)
The differentially expressed genes are expected to be varable between groups or conditions thus, the least variable genes across all samples can be considered as non differentially expressed genes.
For this reason, apart from correcting the p-values of the differential expression analysis, we should do a filtering of the data. To do so, we can use the function nsFilter() from the package genefilter.
annotation(normData) <- "hgu133plus2.db"
filtered <- nsFilter(normData,
require.entrez = TRUE, remove.dupEntrez = TRUE,
var.filter=TRUE, var.func=IQR, var.cutoff=0.75,
filterByQuantile=TRUE, feature.exclude = "^AFFX")
filtData <- filtered$eset
write.csv(x = filtData, file = "./results/Filt_expr_data.tsv", sep = "\t")
Number of genes that remained after filtering.5044
save(normData, filtData, file="./results/normData.Rda")
The first thing we should do is getting the expression levels from the filtered data with the function exprs() from the package affy. This will gives us the RMA-processed expression levels for each gene in each sample.
expr_level <- affy::exprs(filtData)
head(expr_level)
To define the experimental desing with the function model.matrix() in order to see which samples go to which condition.
exp_design <- model.matrix(~ -1+factor(c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)))
colnames(exp_design) <- c("Control","siTAZ","siBCAT","APC_WT")
print("Experimental desing: which samples go in each condition")
## [1] "Experimental desing: which samples go in each condition"
exp_design %>% set_rownames(targets@data$ShortName)
## Control siTAZ siBCAT APC_WT
## siGFP_repA 1 0 0 0
## siGFP_repB 1 0 0 0
## siGFP_repC 1 0 0 0
## siGFP_repD 1 0 0 0
## siTAZ_repA 0 1 0 0
## siTAZ_repB 0 1 0 0
## siTAZ_repC 0 1 0 0
## siTAZ_repD 0 1 0 0
## siBCAT_repA 0 0 1 0
## siBCAT_repB 0 0 1 0
## siBCAT_repC 0 0 1 0
## siBCAT_repD 0 0 1 0
## APCWT_repA 0 0 0 1
## APCWT_repB 0 0 0 1
## APCWT_repC 0 0 0 1
## APCWT_repD 0 0 0 1
## attr(,"assign")
## [1] 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4))`
## [1] "contr.treatment"
To perform the differential expression (DE) analysis and see which genes are up or down regulated, we must state the comparisions between conditions we want to study:
To do so, we must create a contrast matrix with the function makeContrasts() from the package limma.
contrast.matrix <- limma::makeContrasts(APC_WT-Control,
siTAZ-Control,
siBCAT-Control,
siTAZ-APC_WT,
siBCAT-APC_WT,
siBCAT-siTAZ,
levels=c("Control","siTAZ","siBCAT","APC_WT"))
To continue with the DE analysis, we have to fit a regression model with the function lmFit() from limma.
lmfit <- limma::lmFit(expr_level, exp_design)
Then, with the function contrasts.fit() we take the model generated in the last step and compute the estimated coefficients and the standard errors for the contrasts in our contrast matrix. Then, with the function eBayes we compute the t- and F-statistic, etc.
contrast.fit <- limma::contrasts.fit(lmfit, contrast.matrix)
fit.results <- limma::eBayes(contrast.fit)
To extract the differentially expressed genes (DEGs) of each performed contrast, we must run the function topTable() or topTreat() from limma. It is important to indicate an adjustment method for the p-value. In this case we will adjust the p-value using the Benjamini-Hochberg procedure.
Thresholds for adjusted p-value and logFC will also be set in order to call de differentially expressed genes:
apcwt_vs_control <- topTable(fit = fit.results, number=nrow(fit.results), coef=1, adjust="BH", sort.by = "logFC")
print("APC wild-type vs Control")
## [1] "APC wild-type vs Control"
apcwt_vs_control %>% head()
## logFC AveExpr t P.Value adj.P.Val B
## 206619_at 6.494847 7.307492 52.00683 6.883265e-18 2.479942e-15 31.46300
## 211990_at 6.489434 6.383802 69.17250 1.119094e-19 1.725649e-16 35.35426
## 210982_s_at 5.976017 6.425290 55.86770 2.449267e-18 1.372678e-15 32.46221
## 209656_s_at -5.495125 8.907695 -60.93045 6.999387e-19 5.043558e-16 33.65409
## 207529_at 5.083690 5.277854 38.64819 4.944888e-16 5.306812e-14 27.20950
## 226281_at 4.971714 4.167980 44.24159 7.077826e-17 1.487523e-14 29.16487
tazkd_vs_control <- topTable(fit = fit.results, number=nrow(fit.results), coef=2, adjust="BH", sort.by = "logFC")
print("TAZ knock-down vs Control")
## [1] "TAZ knock-down vs Control"
tazkd_vs_control %>% head()
## logFC AveExpr t P.Value adj.P.Val B
## 209108_at -3.825852 8.481125 -61.32430 6.377359e-19 1.608370e-15 33.33589
## 202133_at -3.373756 9.791097 -46.07432 3.944941e-17 2.842612e-14 29.61517
## 226415_at 3.280943 5.894784 26.65565 1.004462e-13 1.151479e-11 21.88405
## 208025_s_at -2.922734 9.627189 -64.67851 2.954681e-19 1.490341e-15 33.99026
## 224964_s_at 2.870532 6.241654 30.10781 1.770113e-14 3.571379e-12 23.64486
## 209094_at -2.781102 10.231877 -56.59273 2.033308e-18 3.418669e-15 32.32407
bcatkd_vs_control <- topTable(fit = fit.results, number=nrow(fit.results), coef=3, adjust="BH", sort.by = "logFC")
print("bCAT knock-down vs Control")
## [1] "bCAT knock-down vs Control"
bcatkd_vs_control %>% head()
## logFC AveExpr t P.Value adj.P.Val B
## 1553602_at -4.292049 8.502158 -33.15276 4.462801e-15 7.034490e-13 25.01363
## 210004_at -4.207942 5.550918 -41.50217 1.775714e-16 5.971135e-14 28.23536
## 212942_s_at -3.819106 8.304102 -43.91733 7.868535e-17 3.608081e-14 29.03271
## 200783_s_at -3.561840 9.155150 -29.33829 2.561747e-14 2.227836e-12 23.23413
## 232523_at 3.521103 6.460373 27.53653 6.323579e-14 3.987017e-12 22.30689
## 1555935_s_at -3.416338 8.876178 -51.62705 7.650795e-18 5.652909e-15 31.27041
tazkd_vs_apcwt <- topTable(fit = fit.results, number=nrow(fit.results), coef=4, adjust="BH", sort.by = "logFC")
print("TAZ knock-down vs APC wild-type")
## [1] "TAZ knock-down vs APC wild-type"
tazkd_vs_apcwt %>% head()
## logFC AveExpr t P.Value adj.P.Val B
## 206619_at -6.781118 7.307492 -54.29911 3.694322e-18 1.035231e-15 32.07825
## 209656_s_at 6.154647 8.907695 68.24329 1.360724e-19 1.538498e-16 35.20664
## 210982_s_at -5.740703 6.425290 -53.66783 4.373393e-18 1.161021e-15 31.91431
## 236646_at 5.632541 5.914005 30.70827 1.334979e-14 4.411493e-13 23.82275
## 211990_at -5.552152 6.383802 -59.18178 1.065872e-18 5.376258e-16 33.27470
## 207529_at -5.000366 5.277854 -38.01473 6.269760e-16 4.650686e-14 26.96207
bcatkd_vs_apcwt <- topTable(fit = fit.results, number=nrow(fit.results), coef=5, adjust="BH", sort.by = "logFC")
print("bCAT knock-down vs APC wild-type")
## [1] "bCAT knock-down vs APC wild-type"
bcatkd_vs_apcwt %>% head()
## logFC AveExpr t P.Value adj.P.Val B
## 206619_at -8.215200 7.307492 -65.78238 2.313682e-19 2.917553e-16 34.65097
## 211990_at -5.875408 6.383802 -62.62745 4.706745e-19 3.956804e-16 33.99587
## 207529_at -5.498924 5.277854 -41.80496 1.599419e-16 2.240964e-14 28.35033
## 210982_s_at -5.256350 6.425290 -49.13979 1.559270e-17 6.554133e-15 30.65439
## 226281_at -5.219382 4.167980 -46.44551 3.514309e-17 8.057353e-15 29.85636
## 225681_at 4.655156 10.238208 48.29636 2.001245e-17 6.840064e-15 30.41009
bcatkd_vs_tazkd <- topTable(fit = fit.results, number=nrow(fit.results), coef=6, adjust="BH", sort.by = "logFC")
print("bCAT knock-down vs TAZ knock-down")
## [1] "bCAT knock-down vs TAZ knock-down"
bcatkd_vs_tazkd %>% head()
## logFC AveExpr t P.Value adj.P.Val B
## 1553602_at -5.243781 8.502158 -40.50415 2.519959e-16 4.220467e-14 27.89416
## 236646_at -4.743819 5.914005 -25.86301 1.542818e-13 4.786884e-12 21.31469
## 210004_at -4.507458 5.550918 -44.45624 6.601417e-17 1.447719e-14 29.23280
## 208025_s_at 4.362499 9.627189 96.53971 9.004242e-22 4.541740e-18 39.41577
## 213496_at -4.095674 4.633688 -42.26309 1.367309e-16 2.554336e-14 28.50711
## 219619_at -4.082648 5.957833 -28.73830 3.440027e-14 1.508826e-12 22.87204
getDegs <- function(df, pval = 0.05, log2FC = 1, db = hgu133plus2.db) {
#load package ggplot2
require(ggplot2)
#stablish degs
df$DEG <- rep("NotDE", length(df$logFC))
for(i in 1:length(df$logFC)){
if(df$adj.P.Val[i]<0.05 & df$logFC[i] > 1){df$DEG[i] = "Upregulated"}
else if(df$adj.P.Val[i]<0.05 & df$logFC[i] < -1){df$DEG[i] = "Downregulated"}
}
#retrieve gene symbols from probe IDs
geneSymbols <- AnnotationDbi::select(x = db, rownames(df), c("SYMBOL", "ENTREZID", "GENENAME"))
df$symbol <- geneSymbols$SYMBOL
df$entrez <- geneSymbols$ENTREZID
df$genename <- geneSymbols$GENENAME
df$probeid <- rownames(df)
#arrange data by p-value
df = df %>% dplyr::arrange(adj.P.Val) %>% set_rownames(df$probeid)
}
apcwt_vs_control2 <- getDegs(apcwt_vs_control)
tazkd_vs_control2 <- getDegs(tazkd_vs_control)
bcatkd_vs_control2 <- getDegs(bcatkd_vs_control)
tazkd_vs_apcwt2 <- getDegs(tazkd_vs_apcwt)
bcatkd_vs_apcwt2 <- getDegs(bcatkd_vs_apcwt)
bcatkd_vs_tazkd2 <- getDegs(bcatkd_vs_tazkd)
dir.create("results/degs/", showWarnings = F, recursive = T)
write.csv(x = apcwt_vs_control2, file = "results/degs/apcwt_vs_control.tsv", sep = "\t")
write.csv(x = tazkd_vs_control2, file = "results/degs/tazkd_vs_control.tsv", sep = "\t")
write.csv(x = bcatkd_vs_control2, file = "results/degs/bcatkd_vs_control.tsv", sep = "\t")
write.csv(x = tazkd_vs_apcwt2, file = "results/degs/tazkd_vs_apcwt.tsv", sep = "\t")
write.csv(x = bcatkd_vs_apcwt2, file = "results/degs/bcatkd_vs_apcwt.tsv", sep = "\t")
write.csv(x = bcatkd_vs_tazkd2, file = "results/degs/bcatkd_vs_tazkd.tsv", sep = "\t")
# we can create new dataframes for the up and down degs of each contrast
subset_degs <- function(input, deg = "Upregulated"){
subset <- data.frame()
for(i in 1:length(input[,1])) {
if(input$DEG[i] == deg) {
subset <- rbind(subset, input[i,])
}
}
return(subset)
}
#Upregulated genes
apcwt_vs_control_up <- subset_degs(apcwt_vs_control2, deg = "Upregulated")
tazkd_vs_control_up <- subset_degs(tazkd_vs_control2, deg = "Upregulated")
bcatkd_vs_control_up <- subset_degs(bcatkd_vs_control2, deg = "Upregulated")
tazkd_vs_apcwt_up <- subset_degs(tazkd_vs_apcwt2, deg = "Upregulated")
bcatkd_vs_apcwt_up <- subset_degs(bcatkd_vs_apcwt2, deg = "Upregulated")
bcatkd_vs_tazkd_up <- subset_degs(bcatkd_vs_tazkd2, deg = "Upregulated")
write.csv(x = apcwt_vs_control_up, file = "results/degs/apcwt_vs_control_up.tsv", sep = "\t")
write.csv(x = tazkd_vs_control_up, file = "results/degs/tazkd_vs_control_up.tsv", sep = "\t")
write.csv(x = bcatkd_vs_control_up, file = "results/degs/bcatkd_vs_control_up.tsv", sep = "\t")
write.csv(x = tazkd_vs_apcwt_up, file = "results/degs/tazkd_vs_apcwt_up.tsv", sep = "\t")
write.csv(x = bcatkd_vs_apcwt_up, file = "results/degs/bcatkd_vs_apcwt_up.tsv", sep = "\t")
write.csv(x = bcatkd_vs_tazkd_up, file = "results/degs/bcatkd_vs_tazkd_up.tsv", sep = "\t")
#Downregulated genes
apcwt_vs_control_down <- subset_degs(apcwt_vs_control2, deg = "Downregulated")
tazkd_vs_control_down <- subset_degs(tazkd_vs_control2, deg = "Downregulated")
bcatkd_vs_control_down <- subset_degs(bcatkd_vs_control2, deg = "Downregulated")
tazkd_vs_apcwt_down <- subset_degs(tazkd_vs_apcwt2, deg = "Downregulated")
bcatkd_vs_apcwt_down <- subset_degs(bcatkd_vs_apcwt2, deg = "Downregulated")
bcatkd_vs_tazkd_down <- subset_degs(bcatkd_vs_tazkd2, deg = "Downregulated")
write.csv(x = apcwt_vs_control_down, file = "results/degs/apcwt_vs_control_down.tsv", sep = "\t")
write.csv(x = tazkd_vs_control_down, file = "results/degs/tazkd_vs_control_down.tsv", sep = "\t")
write.csv(x = bcatkd_vs_control_down, file = "results/degs/bcatkd_vs_control_down.tsv", sep = "\t")
write.csv(x = tazkd_vs_apcwt_down, file = "results/degs/tazkd_vs_apcwt_down.tsv", sep = "\t")
write.csv(x = bcatkd_vs_apcwt_down, file = "results/degs/bcatkd_vs_apcwt_down.tsv", sep = "\t")
write.csv(x = bcatkd_vs_tazkd_down, file = "results/degs/bcatkd_vs_tazkd_down.tsv", sep = "\t")
To visualize the DEGs we will use volcano plots, which are dotplots that draw the log2FC in the X axis and the -log10(pvalue) in the Y axis. The downregulated genes are in the left part and the upregulated ones are in the right part, making the plot look like a volcano.
To draw the volcano plots, the custom function volcanoPlot2() will be used. The function counts the upregulated and downregulated genes. It also annotates the name of the 5 most upregulated and 5 most downregulated genes, respectively.
volcanoPlot2 <- function(df, xlim = c(-6,6), ylim = c(0,16), main = NULL, title_size = 15, labelSize = 10,
pval = 0.05, log2FC = 1, xlab = "log2(FC)", ylab = "-log10(pval)", axis_label_size = 10,
point.color = c("darkgreen", "gray", "red"), legend_pos = "bottom", db = hgu133plus2.db) {
#load package ggplot2
require(ggplot2)
#retrieve gene symbols from probe IDs
geneSymbols <- AnnotationDbi::select(x = db, rownames(df), c("SYMBOL"))
df$symbol <- geneSymbols$SYMBOL
#stablish degs
df$DEG <- rep("NotDE", length(df$logFC))
for(i in 1:length(df$logFC)){
if(df$adj.P.Val[i]<0.05 & df$logFC[i] > 1){df$DEG[i] = "Upregulated"}
else if(df$adj.P.Val[i]<0.05 & df$logFC[i] < -1){df$DEG[i] = "Downregulated"}
}
#arrange data by p-value
df = df %>% dplyr::arrange(logFC)
df2 = rbind(head(df,5), tail(df,5)) %>% as.data.frame()
#generate the graph with ggplot(), stablish the foldchange and pvalue to colour the DEGs and stablish the point format.
p <- ggplot(data = df, aes(x=logFC, y=-log10(adj.P.Val), colour=DEG) ) +
geom_point(alpha=0.7, size=1.5) +
geom_text_repel(data = df2, mapping = aes(x = logFC, y = -log10(adj.P.Val), label = symbol), size = 5, color = "Black") +
#annotate the number of up and downregulated DEGs
annotate("text", label = sum(df$DEG == "Upregulated"), color = "gray35", y = ylim[1], x = xlim[2],
vjust="inward",hjust="inward", size = labelSize) +
annotate("text", label = sum(df$DEG == "Downregulated"), color = "gray35", y = ylim[1], x = xlim[1],
vjust="inward",hjust="inward", size = labelSize) +
#stablish a predefined theme
theme_classic() +
#write and format the graph title, can be nothing.
ggtitle(main) +
theme(plot.title = element_text(face="bold", hjust = .5, size = title_size)) +
#stablish the x and y axes ranges.
coord_cartesian(xlim = xlim, ylim = ylim) +
#put an horizontal line in the -log10(pval) value and two vertival lines in the -logFC and logFC values.
geom_hline(yintercept = -log10(pval), linetype = 2) +
geom_vline(xintercept = c(-log2FC, log2FC), linetype = 2) +
#format the axis names and sizes
xlab(xlab) + ylab(ylab) + theme(axis.title = element_text(size = axis_label_size)) +
scale_colour_manual(values=point.color) +
#decide the position of the legend (default: "bottom")
theme(legend.position = legend_pos, legend.title = element_blank(), legend.text = element_text(size = 10))
#draw the graph.
p
}
volcanoPlot2(apcwt_vs_control, main = "DE of: APC Wildtype vs Control")
volcanoPlot2(tazkd_vs_control, main = "DE of: TAZ Knockdown vs Control")
volcanoPlot2(bcatkd_vs_control, main = "DE of: bCAT Knockdown vs Control")
volcanoPlot2(tazkd_vs_apcwt, main = "DE of: TAZ Knockdown vs APC Wildtype")
volcanoPlot2(bcatkd_vs_apcwt, main = "DE of: bCAT Knockdown vs APC Wildtype")
volcanoPlot2(bcatkd_vs_tazkd, main = "DE of: bCAT Knockdown vs TAZ Knockdown")
To do the GO analysis, we will use the package clusterProfiler(). We will only look for the GO terms related to biological functions.
GOenrich <- function(data, db=hgu133plus2.db, ont="BP", pAdjust="BH", pval=0.05, qval=0.1){
go <- clusterProfiler::enrichGO(gene = data$entrez,
OrgDb = db,
keyType = "ENTREZID",
ont = ont,
pvalueCutoff = pval,
pAdjustMethod = pAdjust,
qvalueCutoff = qval,
readable = TRUE)
}
apc_control_up_go <- GOenrich(apcwt_vs_control_up)
taz_control_up_go <- GOenrich(tazkd_vs_control_up)
bcat_control_up_go <- GOenrich(bcatkd_vs_control_up)
taz_apc_up_go <- GOenrich(tazkd_vs_apcwt_up)
bcat_apc_up_go <- GOenrich(bcatkd_vs_apcwt_up)
bcat_taz_up_go <- GOenrich(bcatkd_vs_tazkd_up)
apc_control_down_go <- GOenrich(apcwt_vs_control_down)
taz_control_down_go <- GOenrich(tazkd_vs_control_down)
bcat_control_down_go <- GOenrich(bcatkd_vs_control_down)
taz_apc_down_go <- GOenrich(tazkd_vs_apcwt_down)
bcat_apc_down_go <- GOenrich(bcatkd_vs_apcwt_down)
bcat_taz_down_go <- GOenrich(bcatkd_vs_tazkd_down)
list <- list("GOterms_BF_APC-vs-Control_Up" = apc_control_up_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_siTAZ-vs-Control_Up" = taz_control_up_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_sibCAT-vs-Control_Up" = bcat_control_up_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_siTAZ-vs-APC_Up" = taz_apc_up_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_sibCAT-vs-APC_Up" = bcat_apc_up_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_sibCAT-vs-siTAZ_Up" = bcat_taz_up_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_APC-vs-Control_Down" = apc_control_down_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_siTAZ-vs-Control_Down" = taz_control_down_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_sibCAT-vs-Control_Down" = bcat_control_down_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_siTAZ-vs-APC_Down" = taz_apc_down_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_sibCAT-vs-APC_Down" = bcat_apc_down_go@result %>% filter(p.adjust < 0.05) %>% add_row(),
"GOterms_BF_sibCAT-vs-siTAZ_Down" = bcat_taz_down_go@result %>% filter(p.adjust < 0.05) %>% add_row())
dir.create(path = "results/degs/GO", showWarnings = F, recursive = T)
openxlsx::write.xlsx(x = list, file = "results/degs/GO/DEGs_GOterms_BF.xlsx")
GOup <- as.data.frame(matrix(c(apcwt_vs_control_up$entrez, tazkd_vs_control_up$entrez,
bcatkd_vs_control_up$entrez, tazkd_vs_apcwt_up$entrez,
bcatkd_vs_apcwt_up$entrez, bcatkd_vs_tazkd_up$entrez))) %>%
dplyr::mutate(Group = c(rep("APCWT vs Control", length(apcwt_vs_control_up$entrez)),
rep("TAZKD vs Control", length(tazkd_vs_control_up$entrez)),
rep("bCATKD vs Control", length(bcatkd_vs_control_up$entrez)),
rep("TAZKD vs APC WT", length(tazkd_vs_apcwt_up$entrez)),
rep("bCATKD vs APC WT", length(bcatkd_vs_apcwt_up$entrez)),
rep("bCATKD vs TAZ KD", length(bcatkd_vs_tazkd_up$entrez)))) %>%
set_colnames(c("ENTREZID", "Group"))
GOup.dot <- compareCluster(geneCluster = ENTREZID~Group,
data = GOup, fun = "enrichGO",
OrgDb = hgu133plus2.db,
keyType = 'ENTREZID', ont = "BP",
pAdjustMethod = "BH", pvalueCutoff = 0.05,
qvalueCutoff = 0.1, readable = TRUE)
enrichplot::dotplot(object = GOup.dot, title = "UP DEG-related GO terms",
showCategory = 5, color = "p.adjust", font.size = 7, split = T) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 0.9))
GOdown <- as.data.frame(matrix(c(apcwt_vs_control_down$entrez, tazkd_vs_control_down$entrez,
bcatkd_vs_control_down$entrez, tazkd_vs_apcwt_down$entrez,
bcatkd_vs_apcwt_down$entrez, bcatkd_vs_tazkd_down$entrez))) %>%
dplyr::mutate(Grodown = c(rep("APCWT vs Control", length(apcwt_vs_control_down$entrez)),
rep("TAZKD vs Control", length(tazkd_vs_control_down$entrez)),
rep("bCATKD vs Control", length(bcatkd_vs_control_down$entrez)),
rep("TAZKD vs APC WT", length(tazkd_vs_apcwt_down$entrez)),
rep("bCATKD vs APC WT", length(bcatkd_vs_apcwt_down$entrez)),
rep("bCATKD vs TAZ KD", length(bcatkd_vs_tazkd_down$entrez)))) %>%
set_colnames(c("ENTREZID", "Group"))
GOdown.dot <- compareCluster(geneCluster = ENTREZID~Group,
data = GOdown, fun = "enrichGO",
OrgDb = hgu133plus2.db,
keyType = 'ENTREZID', ont = "BP",
pAdjustMethod = "BH", pvalueCutoff = 0.05,
qvalueCutoff = 0.1, readable = TRUE)
enrichplot::dotplot(object = GOdown.dot, title = "DOWN DEG-related GO terms",
showCategory = 5, color = "p.adjust", font.size = 7, split = F) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 0.9))